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Abstract 

Understanding  how  biological  visual  systems  perform  object  recognition  is  one  of  the  ultimate  goals  in 
computational  neuroscience.  Among  the  biological  models  of  recognition  the  main  distinctions  are  be¬ 
tween  feedforward  and  feedback  and  between  object-centered  and  view-centered.  From  a  computational 
viewpoint  the  different  recognition  tasks  —  for  instance  categorization  and  identification  —  are  very  sim¬ 
ilar,  representing  different  trade-offs  between  specificity  and  invariance.  Thus  the  different  tasks  do  not 
strictly  require  different  classes  of  models.  The  focus  of  the  review  is  on  feedforward,  view-based  models 
that  are  supported  by  psychophysical  and  physiological  data. 
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1  Introduction 

1.1  Object  recognition  is  a  difficult  computational 
problem 

Imagine  waiting  for  incoming  passengers  at  the  arrival 
gate  at  the  airport.  The  small  camera  in  the  buttonhole 
of  your  lapel  looking  at  the  incoming  crowd  suddenly 
tells  you  that  Dr.  Jennings  is  the  third  from  right,  partly 
occluded  by  a  woman  in  front.  Today  his  tie  —  the  cam¬ 
era  says  —  shows  a  pattern  of  antique  cars.  Computer 
vision  is  well  on  its  way  to  solve  restricted  versions  of 
the  problem  of  object  recognition  —  both  in  identifi¬ 
cation  (recognizing  Jenning's  specific  face)  and  catego¬ 
rization  (recognizing  the  patterns  on  the  tie  as  cars).  A 
system,  however,  that  were  capable  of  categorizing  all 
types  of  objects  in  complex  images,  of  recognizing  indi¬ 
vidual  objects  like  faces  and  of  providing  human-level 
performance  under  different  illuminations  and  view¬ 
points  would  pass  the  Turing  test  for  vision.  Not  sur¬ 
prisingly,  such  a  general  and  flexible  system  is  still  the 
stuff  of  science  fiction.  Object  recognition  is  at  the  top 
of  a  hierarchy  of  visual  tasks.  In  its  general  form,  it  is  a 
very  difficult  computational  problem,  which  is  likely  to 
play  a  significant  role  in  eventually  making  intelligent 
machines.  Not  surprisingly  it  is  an  even  more  difficult, 
open  and  key  problem  for  neuroscience. 

1.2  Multiple  tasks  and  strategies  in  object 
recognition 

As  the  airport  scenario  shows,  an  object  can  be  recog¬ 
nized  at  different  levels  of  specificity.  It  can  be  catego¬ 
rized  as  a  member  of  a  general  class,  such  as  "face"  or 
"car".  It  can  also  be  identified  as  a  unique  individual, 
such  as  "Jenning's  face"  or  "my  car". 

Identification  and  categorization  are  the  two  main 
tasks  in  recognition.*  Which  of  the  two  tasks  is  easier 
and  which  comes  first?  The  answers  from  neuroscience 
and  computer  vision  are  strikingly  different.  Typically, 
computer  vision  techniques  found  identification  rela¬ 
tively  easy  —  as  shown  by  the  several  companies  sell¬ 
ing  face  identification  systems  —  and  categorization 
close  to  impossible.  Psychologists  and  neuroscientists 
like  to  tell  the  opposite  story  (see  [51]  and,  for  reviews, 
[24,  49]):  in  biological  visual  systems,  categorization 
seems  to  be  the  simpler  and  more  immediate  stage  in 
the  recognition  process.  In  any  case  it  has  been  com¬ 
mon  in  the  past  few  years  —  in  computer  vision  and 
especially  in  visual  neuroscience  —  to  assume  that  dif¬ 
ferent  strategies  are  required  for  these  different  recogni¬ 
tion  tasks  (but  see  the  books  by  Edelman  [10]  and  Ull- 
man  [49]). 


*  Together  with  motor-related  shape  estimation. 


1.3  A  continuum  of  recognition  tasks  along  the 
trade-off  between  specificity  and  invariance 

In  this  review  we  start  from  a  rather  different  Ansatz. 
From  a  theoretical  standpoint,  identification  and  cate¬ 
gorization,  rather  than  two  distinct  tasks,  represent  two 
points  in  a  spectrum  of  generalization  levels  [49]  d  An 
appropriate  theoretical  framework  for  object  recogni¬ 
tion  is  computational  learning.  Within  it,  the  distinc¬ 
tion  between  identification  and  categorization  is  mostly 
irrelevant.  The  relevant  variables  are  the  size  of  the 
training  set,  the  universe  of  distractors,  the  number  of 
classes  and  the  "legal"  transformations  allowed  for  gen¬ 
eralization. 

We  believe  that  the  crux  of  the  problem  of  object 
recognition  is  the  trade-off  between  (object,  class)  speci¬ 
ficity  and  (transformation)  invariance  (see  below).  The 
distinction  in  the  literature  between  different  flavours 
of  identification  and  categorization  is  an  idiosyncratic 
tale  of  this  trade-off. 

1.4  The  same  basic  computational  strategy  can  be 
used  from  identification  to  categorization 

From  this  reasoning  we  expect  that  the  same  compu¬ 
tational  strategies  could  be  adapted  to  perform  either 
identification  or  categorization.  Thus,  the  existence  of 
multiple  recognition  tasks  does  not  require  radically 
different  algorithms  or  representations.  Multiple  recog¬ 
nition  strategies  are  nevertheless  very  likely  to  exist  in 
biological  vision,  such  as  the  "immediate",  perceptual 
recognition  based  on  similarity  of  visual  appearance 
as  opposed  to  recognition  based  on  motion  dynamics 
or  reasoning-based  recognition  (e.g.,  interpreting  maps 
or  technical  diagrams).  Thus  we  both  expect  multi¬ 
ple  strategies  (e.g.,  algorithms)  for  the  same  recogni¬ 
tion  task  and  the  same  basic  strategy  for  multiple  tasks 
(e.g.,  identification  and  categorization). 

1.5  Models  and  experiments 

The  main  reason  for  the  lengthy  introduction  is  our  be¬ 
lief  that  models  are  necessary  to  make  sense  of  the  data 
and  more  importantly  to  plan  new  experiments.  With¬ 
out  quantitative  models  (properly  tested),  it  may  be  dif¬ 
ficult  to  ask  the  right  questions. 

Our  brief  review  of  models  of  object  recognition  is 
far  from  comprehensive,  even  within  the  caveats  of 
the  previous  discussion.  For  instance,  we  will  con¬ 
fine  ourselves  to  the  recognition  of  isolated  objects  (for 
some  ideas  on  biological  object  recognition  in  clutter  see 
[1,  36]),  and  will  focus  on  recent  developments  and  on 

^Notice  that  even  identification  tasks  can  differ  widely  in 
difficulties  and  requirements:  consider,  for  instance,  the  prob¬ 
lem  of  identifying  the  image  of  a  specific  old  high-school 
friend  among  all  the  pictures  of  faces  stored  among  the  ter¬ 
abytes  of  the  World  Wide  Web  versus  identifying  who  among 
my  two  siblings  is  in  the  picture  that  my  mother  has  on  her 
coffee  table. 
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those  models  that  can  be  directly  related  to  experimen¬ 
tal  physiological  data.  We  show  that  a  mix  of  models 
and  data  has  brought  us  closer  to  understanding  some 
of  the  cortical  mechanisms  of  recognition  and  will  dis¬ 
cuss  key  questions  that  lie  ahead. 

2  Models:  Object-Centered  and 

View-Based,  Feedforward  and  Feedback 

The  models  proposed  to  explain  object  recognition  can 
be  coarsely  divided  into  two  categories:  object-centered 
and  view-based  (or  image-based  or  appearance-based). 
In  the  former  group  of  models,  the  recognition  pro¬ 
cess  consists  in  extracting  a  view-invariant  structural 
description  of  the  object  that  is  then  matched  to  stored 
object  descriptions.  One  of  the  most  prominent  models 
of  this  type  is  the  "Recognition-by-Components"  (RBC) 
theory  of  Biederman  [3, 19],  whose  emphasis  on  repre¬ 
senting  an  object  by  decomposing  it  into  basic  geomet¬ 
rical  shapes  is  reminiscent  of  the  scheme  proposed  by 
Marr  and  Nishihara  [25].  RBC  predicts  that  recognition 
of  objects  should  be  viewpoint-invariant  as  long  as  the 
same  structural  description  can  be  extracted  from  the 
different  object  views. 

In  contrast  to  structural  description  models,  the  ba¬ 
sic  tenet  of  view-  or  image-based  models  is  that  objects 
are  represented  as  collections  of  view-specific  features, 
leading  to  recognition  performance  that  is  a  function 
of  previously  seen  object  views.  In  the  following,  the 
term  'View"  is  used  in  the  broad  sense  of  "image-based 
appearance".  Thus  different  views  correspond  to  dif¬ 
ferent  appearances,  due,  for  instance,  to  different  view¬ 
points  or  different  illuminations,  or  different  conditions 
such  as  different  facial  expressions.  A  view  is  not  re¬ 
stricted  to  contain  just  2D  information;  it  may  have  3D 
information  as  well,  for  instance  because  of  stereo  or 
structure-from-motion.  A  zoo  of  view-based  models  of 
object  recognition  exists  in  the  literature,  each  employ¬ 
ing  very  different  computational  mechanisms.  Two  ma¬ 
jor  groups  of  models  can  be  discerned  based  on  whether 
they  employ  a  purely  feedforward  model  of  processing 
or  utilize  feedback  connections  (for  the  recognition  pro¬ 
cess,  i.e.,  excluding  a  learning  phase,  in  which  top-down 
teaching  signals  are  likely  to  be  used). 

Feedback  models  include  architectures  that  per¬ 
form  recognition  by  using  an  analysis-by-synthesis  or 
hypothesis-and-test  approach:  the  system  makes  a 
guess  about  the  object  that  may  be  in  the  image,  syn¬ 
thesizes  a  neural  representation  of  it  relying  on  stored 
memories,  measures  the  difference  between  the  halluci¬ 
nation  and  the  actual  visual  input  and  proceeds  to  cor¬ 
rect  the  initial  hypothesis.  The  models  of  Rao  &  Ballard 
[35],  or  of  Mumford  [30],  and  in  part  Ullman's  [49]  be¬ 
long  to  this  category.  Other  models  use  feedback  con¬ 
trol  to  "renormalize"  the  input  image  in  position  and 
scale  before  attempting  to  match  it  to  a  database  of 
stored  objects  (as  in  the  "shifter"  circuit  [2,  31]),  or  to 


conversely  tune  the  recognition  system  depending  on 
the  object's  transformed  state  (for  instance  by  matching 
filter  size  to  object  size  [15]). 

While  feedback  processing  is  essential  for  object 
recognition  in  the  previous  group  of  models,  other 
image-based  models  rely  on  just  feedforward  process¬ 
ing.  One  of  the  earliest  representatives  of  this  class 
of  models  is  the  "Neocognitron"  of  Fukushima  [12],  a 
hierarchical  network  in  which  feature  complexity  and 
(translation)  invariance  were  alternatingly  increased  in 
different  ("S"  and  "C",  resp.)  layers  of  a  processing 
hierarchy  by  a  template  match,  and  a  pooling  opera¬ 
tion  over  units  tuned  to  the  same  feature  but  at  dif¬ 
ferent  positions,  respectively.  The  concept  of  pooling 
of  units  tuned  to  transformed  versions  of  the  same  ob¬ 
ject  or  feature  was  subsequently  proposed  by  Perrett  & 
Oram  to  explain  invariance  also  to  non-affine  transfor¬ 
mations  such  as  invariance  to  rotation  in  depth  or  illu¬ 
mination  changes  [33].  Indeed,  Poggio  &  Edelman  [34] 
had  shown  earlier  that  view-invariant  recognition  of  an 
object  was  possible  by  interpolating  between  a  small 
number  of  stored  views  of  that  object. 

The  strategy  of  using  different  computational  mech¬ 
anisms  to  attain  the  twin  goals  of  invariance  and  speci¬ 
ficity  (as  opposed  to  a  homogeneous  architecture  as 
used  in,  for  example,  Wallis'  and  Rolls'  VisNet  [53]) 
has  been  employed  successfully  in  later  models,  among 
them  Mel's  SEEMORE  system  [26]  that  represented  ob¬ 
jects  by  histograms  over  various  feature  channels,  and 
the  HMAX  model  by  Riesenhuber  and  Poggio  [36,  37, 
45],  whose  structure  is  similar  to  Fukushima's  Neocog¬ 
nitron  with  its  feature  complexity-increasing  "S"  layers 
and  invariance-increasing  "C"  layers.  HMAX,  however, 
uses  a  new  pooling  mechanism  (a  MAX  operation)  to 
increase  invariance  in  the  "C"  layers,  in  which  the  most 
strongly  activated  afferent  determines  the  response  of 
the  pooling  unit,  endowing  the  system  with  the  abil¬ 
ity  to  isolate  the  feature  of  interest  from  non-relevant 
background  and  thus  build  feature  detectors  robust  to 
translation  and  scale  changes,  and  even  clutter  [36]. 
More  complex  features  in  higher  levels  of  HMAX  are 
thus  built  from  simpler  features  with  tolerance  to  de¬ 
formations  in  their  local  arrangement  due  to  the  invari¬ 
ance  properties  of  the  lower  level  units.  In  this  respect, 
HMAX  is  similar  to  (so  far  non-biological)  recognition 
architectures  based  on  feature  trees  which  emphasize 
compositionality  [1]. 

2.1  A  basic  module:  feedforward  and  view-based 

We  are  thus  left  with  two  major  fault  lines  running 
through  the  landscape  of  models  of  object  recognition: 
object-centered  vs.  image-based,  and,  within  the  latter 
group,  feedforward  vs.  feedback  models.  How  well  do 
these  different  model  classes  hold  up  to  constraints  de¬ 
rived  from  neurophysiological  data? 

Psychophysical  data  from  humans  [5,  44,  47]  as  well 
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as  monkeys  [22]  point  to  a  view-dependence  of  object 
recognition  (for  reviews,  see  [24,  46]).  Interestingly, 
data  from  physiology  also  support  a  view-based  the¬ 
ory:  several  studies  have  previously  shown  that  cells 
in  the  inferotemporal  cortex  (IT)  of  macaque  monkeys 
(an  area  thought  to  be  crucial  for  object  recognition 
[24,  43])  respond  to  views  of  complex  objects,  such  as 
faces  [4,  8].  Logothetis  and  co-workers  [23]  systemati¬ 
cally  studied  the  tuning  properties  of  IT  cells  by  train¬ 
ing  a  monkey  to  perform  recognition  of  "paperclip"  ob¬ 
jects,  strictly  controlling  the  object  views  the  monkeys 
had  been  exposed  to  during  training.  Even  though  the 
monkeys  had  access  to  the  full  3D  shape  description  of 
the  object  (by  presenting  the  object  as  rotating  in  depth 
by  ±10°),  psychophysical  experiments  [22]  showed  (in 
agreement  with  human  studies  [5])  that  recognition  was 
based  around  the  views  seen  during  training.  Even 
more  intriguing,  when  Logothetis  et  ah  recorded  from 
IT  neurons  of  trained  monkeys  [23],  they  found  cells 
tuned  to  views  of  the  training  objects,  along  with  a  much 
smaller  number  of  view-invariant  neurons  tuned  to  ob¬ 
jects  the  monkey  had  been  trained  to  recognize  from 
any  viewpoint,  as  predicted  by  the  model  of  Poggio 
and  Edelman  [34].  Moreover,  psychophysical  recogni¬ 
tion  performance  and  neuronal  tuning  seemed  to  be  in¬ 
timately  related.  Further  constraining  computational 
models  of  object  recognition  are  findings  from  EEG 
studies  [48]  that  have  shown  that  humans  appear  to  be 
able  to  perform  object  detection  tasks  (such  as  deter¬ 
mining  whether  an  image  contains  an  animal  or  not)  in 
natural  images  within  150  ms,  which  is  on  the  order  of 
the  latency  of  visual  signals  from  primary  visual  cortex 
to  inferotemporal  cortex  [13, 40].  This  does  not  rule  out 
the  use  of  feedback  processing  but  strongly  constrains 
its  role  in  "immediate"  object  recognition. 

In  summary,  the  combined  weight  of  experimental 
data  and  theoretical  work  strongly  suggests  that  feed¬ 
forward  view-based  models  describe  well  one  of  the  ba¬ 
sic  strategies  used  by  the  brain  for  "immediate"  recog¬ 
nition  of  3D  objects.  In  the  rest  of  the  review  we  will 
focus  on  this  class  of  models. 

2.2  Invariance  and  specificity 

The  different  approaches  reviewed  in  the  previous  sec¬ 
tion  illuminate  a  central  issue  in  object  recognition, 
namely  the  invariance-specificity  trade-off:  Recogni¬ 
tion  should  be  tolerant  to  object  transformations  such  as 
scaling,  translation,  or  viewpoint  changes  (and,  for  the 
case  of  categorization,  also  to  shape  variations  within  a 
class),  i.e.,  generalize  over  a  variety  of  image  changes, 
while  at  the  same  time  being  able  to  finely  discriminate 
between  different  objects  (for  identification)  or  differ¬ 
ent  object  classes  (for  categorization).  The  visual  sys¬ 
tem  seems  able  to  satisfy  both  goals  of  specificity  and 
invariance  simultaneously,  but  with  different  degrees 
of  success  depending  on  the  transformation  in  ques¬ 


tion,  as  shown  in  the  object  identification  experiment 
by  Logothetis  et  ah  [23]:  while  their  view-tuned  IT  units 
(VTUs)  generally  showed  only  narrow  (in  terms  of  im¬ 
age  similarity  as  measured,  for  instance,  by  correla¬ 
tion)  invariance  for  rotation  in  depth,  they  show  rel¬ 
atively  great  tolerance  to  changes  in  stimulus  position 
and  scale  changes  ([23],  see  caption  of  Fig.  1). 

Thus,  not  all  object  transformations  appear  to  be 
treated  equally,  in  agreement  with  computational  con¬ 
siderations.  The  effects  of  affine  transformations  in  the 
image  plane,  such  as  scaling  or  2D  translation,  can  be 
estimated  exactly  from  just  one  object  view.  To  deter¬ 
mine  the  behavior  of  a  specific  object  under  transforma¬ 
tions  that  depend  on  its  3D  shape,  such  as  illumination 
changes  or  rotation  in  depth,  however,  one  view  gen¬ 
erally  is  not  sufficient.  These  fundamental  limitations 
are  borne  out  by  the  observed  invariance  properties  of 
the  view-tuned  IT  neurons  as  described  above:  while 
it  is  possible  to  construct  a  translation-  and  scaling- 
invariant  set  of  features  that  allows  the  system  to  per¬ 
form  position-  and  size-invariant  recognition  of  novel 
objects,  invariance  to  3D-based  transformations  does 
not  transfer  as  freely  but  has  to  be  learned  anew  for 
each  paperclip  ([5,  34])  individually  (for  "nice"  object 
classes  in  which  the  objects  have  a  similar  3D  shape  and 
behave  similarly  under  the  transformation  in  question 
[51],  [52],  invariance  might  in  part  transfer  to  other  class 
members,  see  below).  In  categorization,  generalization 
is  across  members  of  the  class.  Thus,  multiple  exam¬ 
ple  views  are  also  needed  to  capture  the  appearance 
of  multiple  objects.  Unlike  affine  2D  transformations, 
3D  rotations,  as  well  as  illumination  changes  and  shape 
variations  within  a  class,  may  require  multiple  example 
views  during  learning. 

3  Models  of  Object  Recognition:  A 
Summary 

Figure  2  summarizes,  in  an  oversimplified  and  cartoon- 
ish  way,  the  discussion  above,  putting  together  and  ex¬ 
tending  models  such  as  HMAX  [36],  Poggio  &  Edelman 
[34],  Perrett  &  Oram  [33],  the  Neocognitron  [12],  and 
even  VisNet  [53].  An  initial  "view-based  module"  stage 
takes  care  of  the  invariance  to  image-based  transforma¬ 
tions  leading  to  view-tuned  cells  —  several  for  each  ob¬ 
ject.  In  the  following,  with  view-tuned  cells  we  mean 
cells  tuned  to  a  full  or  a  partial  view  (i.e.,  connected 
only  to  a  few  of  the  feature  units  activated  by  the  ob¬ 
ject  view  [36])  of  an  object.  At  higher  stages,  invari¬ 
ance  to  rotation  in  depth  (illumination,  facial  expres¬ 
sion,  etc.)  is  achieved  by  pooling  together  the  view- 
tuned  cells  for  each  object.  Finally,  categorization  and 
identification  tasks,  up  to  the  motor  response  if  neces¬ 
sary,  are  performed  by  circuits  looking  at  the  activities 
of  the  object-specific  and  view-invariant  cells  (or,  in  the 
absence  of  relevant  view-invariant  units,  e.g.,  when  the 
subject  has  only  experienced  an  object  from  a  certain 
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viewpoint  as  in  the  experiments  on  paperclip  recogni¬ 
tion  [5,  22,  23],  directly  at  the  view-tuned  units,  as  in¬ 
dicated  by  the  dashed  lines  in  Fig.  2)  .  In  general,  a 
particular  object,  say  a  specific  face,  will  elicit  different 
activity  in  the  On  cells  tuned  to  a  small  number  of  "pro¬ 
totypical"  faces  [39].  Thus  the  memory  of  the  particular 
face  is  represented  in  the  identification  circuit  in  an  im¬ 
plicit  way  (z.c.,  without  dedicated  "grandmother"  cells) 
by  a  population  code.  In  a  similar  way,  a  categorization 
module  —  say,  for  dogs  vs.  cats  —  uses  as  inputs  the 
activities  of  a  number  of  cells  tuned  to  various  animals, 
with  weights  set  so  that  the  unit  responds  differently  to 
animals  from  different  classes  [38]. 

4  Beyond  the  Recognition  of  Specific 
Objects:  Object  Classes 
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Figure  1:  Invariance  properties  of  one  neuron  (modified  from 
Logothetis  et  al.  [23]).  The  figure  shows  the  response  of  a 
single  cell  found  in  anterior  IT  after  training  the  monkey  to 
recognize  paperclip-like  objects.  The  cell  responded  selec¬ 
tively  to  one  view  of  a  paperclip  and  showed  limited  invari¬ 
ance  around  the  training  view  to  rotation  in  depth,  along  with 
significant  invariance  to  translation  and  size  changes,  even 
though  the  monkey  had  only  seen  the  stimulus  at  one  position 
and  scale  during  training,  (a)  shows  the  response  of  the  cell 
to  rotation  in  depth  around  the  preferred  view,  (b)  shows  the 
cell's  response  to  the  10  distractor  objects  (other  paperclips) 
that  evoked  the  strongest  responses.  The  lower  plots  show 
the  cell's  response  to  changes  in  stimulus  size,  (c)  (asterisk 
shows  the  size  of  the  training  view),  and  position,  (d)  (using 
the  1.9°  size),  resp.,  relative  to  the  mean  of  the  10  best  distrac¬ 
tors.  Defining  "invariance"  as  yielding  a  higher  response  to 
transformed  views  of  the  preferred  stimulus  than  to  distrac¬ 
tor  objects,  neurons  exhibit  an  average  rotation  invariance  of 
42°  (during  training,  stimuli  were  actually  rotated  by  ±15°  in 
depth  to  provide  full  3D  information  to  the  monkey;  there¬ 
fore,  the  invariance  obtained  from  a  single  view  is  likely  to 
be  smaller),  translation  and  scale  invariance  on  the  order  of 
±2°  and  ±1  octave  around  the  training  view,  resp.  (J.  Pauls, 
personal  communication). 


4.1  A  basic  module  for  identification  and 
categorization:  sparse  population  codes 

Most  experimental  studies  of  object  recognition  have  fo¬ 
cussed  on  testing  the  recognition  performance  on  the 
same  (small  number  of)  objects  used  during  training 
[5, 14, 22, 44].  However,  in  everyday  object  recognition, 
the  ability  to  generalize  from  previously  seen  objects  of 
a  class  to  novel  representatives  of  the  same  class,  such 
as  in  the  case  of  faces,  is  essential.  The  difference  in 
the  object  recognition  tasks  —  no  generalization  over 
shape  in  the  first  case  in  favor  of  high  specificity  vs.  the 
ability  to  also  discriminate  between  novel  objects  in  the 
second  case  —  appears  also  to  be  reflected  in  the  neu¬ 
ronal  tuning  of  object-tuned  neurons:  while  Logothetis 
et  al.  [23]  found  neurons  that  were  tightly  shape-tuned 
("grandmother  "-like  neurons),  with  a  neuron  respond¬ 
ing  to  (a  view  of)  just  to  a  single  object  from  the  train¬ 
ing  set,  recent  studies  of  face  cells  in  IT  have  argued  for 
a  distributed  representation  of  this  object  class  where 
the  identity  of  a  face  is  jointly  encoded  by  the  activa¬ 
tion  pattern  over  a  group  of  face  units  [54,  55],  corre¬ 
sponding  in  the  model  of  Fig.  2  to  an  activation  pat¬ 
tern  over  view-tuned  (and  object-tuned)  units  belong¬ 
ing  to  different  objects  (none  of  which  in  general  is  iden¬ 
tical  to  the  input  object,  unlike  in  the  "grandmother" 
case).  Discrimination  (or  memorization  of  specific  ob¬ 
jects,  Fig.  3a)  can  then  proceed  by  comparing  activation 
patterns  over  the  relevant  (i.e.,  the  strongly  activated) 
object-  or  view-tuned  units  —  with  the  advantage  that 
for  a  certain  level  of  specificity,  only  the  activations  of 
a  small  number  of  units  have  to  be  remembered,  form¬ 
ing  a  sparse  code  (in  contrast  to  activation  patterns  on 
lower  levels  where  units  are  less  specific  and  hence  ac¬ 
tivation  patterns  tend  to  be  more  distributed).  Recent 
computational  studies  in  our  lab  [39]  have  provided 
evidence  for  the  feasibility  of  such  a  representation, 
with  a  discrimination  performance  comparable  to  that 
achieved  by  dedicated  "grandmother"  units.  An  inter¬ 
esting  and  non-trivial  conjecture  (supported  by  several 
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Figure  2:  Sketch  of  a  class  of  models  of  object  recognition,  combining  and  extending  models  such  as  Fukushima  [12],  Poggio 
&  Edelman  [34],  Perrett  &  Oram  [33],  VisNet  [53],  and  HMAX  [37].  View-tuned  units  (14)  (on  top  of  a  view-based  module, 
as  shown  in  the  inset  [37])  in  the  model  exhibit  tight  tuning  to  rotation  in  depth  (and  illumination,  and  other  object-dependent 
transformations  such  as  facial  expression  etc.)  but  are  tolerant  to  scaling  and  translation  of  their  preferred  object  view.  Notice  that 
the  cells  labeled  here  and  in  Fig.  3  as  view-tuned  units  may  be  tuned  to  full  or  partial  views.  Invariance  to  rotation  in  depth  (as 
an  example  of  an  object-dependent  transformation)  can  then  be  significantly  increased  by  interpolating  between  several  view- 
tuned  units  tuned  to  different  views  of  the  same  object  [34],  creating  view-invariant  (or  object-tuned)  units  ( On )•  These,  as  well 
as  the  view-tuned  units,  can  then  serve  as  input  to  task  modules  performing  visual  tasks  such  as  identification/ discrimination 
or  object  categorization  (see  below).  Categorization  could  be  supported  even  just  by  connections  with  the  cells  in  the  last  layer  of 
the  model  in  the  inset.  For  most  object  classes,  the  object-tuned  units  will  be  much  fewer  than  the  objects  that  can  be  identified. 
The  stages  up  to  the  object-centered  unit  probably  encompass  VI  to  anterior  IT  (AIT).  The  last  stage  of  task  dependent  modules 
may  be  localized  in  prefrontal  cortex  (PFC)  and  beyond  (see  D.J.  Freedman  et  ah,  Soc.  Neurosci.  Abs.,  1999). 
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experiments  [9, 28, 39],  see  also  [10])  of  this  population- 
based  representation  is  that  it  should  be  capable  of  gen¬ 
eralizing  from  a  single  view  of  a  new  object  of  a  nice  [51] 
class  —  such  as  a  specific  face  —  to  other  views  with  a 
higher  performance  than  for  non-nice  objects  —  such  as 
paperclips. 

The  same  substrate,  a  population-based  class  repre¬ 
sentation,  has  the  nice  property  that  it  can  also  support 
categorization,  as  sketched  in  Fig.  3b:  for  instance,  a 
"cat/ dog  categorization  unit"  can  be  connected  to  units 
responding  to  "cat"  and  "dog"  prototypes*  in  such  a 
way  that  it  shows  different  response  levels  for  cats  and 
dogs,  respectively,  as  we  have  recently  demonstrated 
[38]. 

4.2  A  unified  view 

Thus,  we  see  that  the  task-related  "black  boxes"  at  the 
top  of  Fig.  2  can  possibly  be  realized  as  straightforward 
extensions  of  the  previously  proposed  architecture:  us¬ 
ing  template  match  operations,  units  in  higher  layers 
can  learn  to  perform  categorization  tasks  (Fig.  3b),  or 
to  identify  individual  objects  ("my  car".  Fig.  3a).  In  all 
these  cases,  inputs  to  the  top  level  units  can  be  tuned  to 
full  or  partial  views  (as  illustrated  in  Fig.  3)  or  originate 
from  object-tuned  neurons,  as  described  before.  Thus, 
there  is  a  dissociation  of  the  tuning  to  individual  stimuli 
and  their  labels  in  different  recognition  tasks,  allowing 
the  system,  for  instance,  to  implement  different  catego¬ 
rization  schemes  [38]  and  hierarchies  of  categories  on 
the  same  stimuli. 

An  intriguing  possibility  is  that  these  "top-level" 
units  of  Fig.  3a-b  might  themselves  serve  as  inputs  to 
other  task-related  units,  as  shown  in  Fig.  3c,  e.g.,  when 
learning  additional  hierarchy  levels  in  a  categorization 
scheme. 

5  Challenges  Ahead 

5.1  Top-down  and  role  of  feedback 

In  this  review  we  have  taken  the  view  that  basic  recog¬ 
nition  processes  take  place  in  a  bottom-up  way;  it  is, 
however,  very  likely  that  top-down  signals  play  an  es¬ 
sential  role  in  controlling  the  learning  phase  of  recog¬ 
nition  [18]  and  in  some  top-down  effects  (for  instance 
in  detection  tasks,  to  bias  recognition  towards  the  fea¬ 
tures  of  interest,  as  suggested  by  physiological  studies 
[6,  16,  27,  29]).  There  is  an  obvious  anatomical  sub¬ 
strate  for  top-down  processing:  the  massive  descending 
projections  in  the  visual  cortex  that  tend  to  reciprocate 
the  forward  connections.  Ullman  [49]  suggested  a  role 
in  top-down  processing  for  matching  models  to  inputs 
that  is  symmetric  and  as  important  as  the  bottom-up 
matching  of  inputs  to  models  (see  also  the  Helmholtz 

*Of  course  they  can  also  be  connected  directly  to  earlier 
"components"  units:  categorization  is  known  to  be  sensitive 
to  partial  matches. 


Machine  [7]).  Other  roles  for  top-down  processing  have 
been  proposed  such  as  controlling  attention  [21],  and 
grouping  and  synchronization  of  neural  groups  [41]. 

5.2  Learning 

We  can  learn  to  recognize  a  specific  object  (such  as  a 
new  face)  immediately  after  a  brief  exposure.  The  mod¬ 
els  we  described  predict  that  only  the  last  stages  need 
to  change  their  synaptic  connections  over  a  fast  time 
scale.  Current  psychophysical,  physiological  and  fMRI 
evidence,  however,  suggests  that  learning  takes  place 
throughout  the  cortex  from  VI  to  IT  and  beyond.  It  is 
natural  to  assume  that  modifications  of  earlier  layers 
take  place  over  longer  times  and  are  thus  experience- 
dependent  but  less  object-specific.  A  challenge  lies  in 
finding  a  learning  scheme  that  describes  how  input 
stimuli  drive  the  development  of  features  at  lower  lev¬ 
els,  while  at  the  same  time  assuring  that  features  of 
the  same  type  are  pooled  over  in  an  appropriate  fash¬ 
ion  by  the  pooling  units.  Hyvarinen  and  Hoyer  [20] 
have  recently  presented  a  learning  rule  whose  aim  is 
to  decompose  an  image  into  independent  feature  sub¬ 
spaces.  The  learning  rule  is  similar  to  the  independence 
maximization  rule  with  a  sparsity  prior  used  by  Ol- 
shausen  and  Field  [32]  with  the  difference  that  here  the 
independence  between  the  norms  of  projections  on  lin¬ 
ear  subspaces  is  maximized.  With  this  learning  rule, 
Hyvarinen  and  Hoyer  are  able  to  learn  shift-  and  phase- 
invariant  features  similar  to  complex  cells.  It  remains  to 
be  seen  whether  a  hierarchical  version  of  such  a  scheme 
to  construct  an  increasingly  complex  set  of  features  is 
also  feasible.  Wallis  and  Rolls  [53]  have  studied  learn¬ 
ing  at  all  levels  in  a  model  of  object  recognition  (using  a 
variant  of  Foldiak's  Trace  Rule  [11])  capable  of  recogniz¬ 
ing  simple  configurations  of  bars,  and  even  faces.  Ull¬ 
man  has  suggested  a  computational  scheme  in  which 
features  are  learned  though  the  selection  of  significant 
components  common  to  different  examples  of  a  class 
of  objects  [50].  It  would  be  interesting  to  translate  the 
main  aspects  of  his  proposal  in  a  biologically  plausible 
circuitry. 

5.3  The  time  dimension 

The  models  reviewed  here  do  not  take  into  explicit  ac¬ 
count  the  fact  that  the  retinal  input  usually  has  a  time 
component  to  it:  objects  move  and  the  eyes  move,  too. 
In  addition,  any  neural  circuit  will  have  its  own  time 
dynamics.  However,  most  of  the  models  so  far  are 
not  sufficiently  detailed  for  making  any  reasonable  pre¬ 
diction.  Measured  neuronal  responses  are  functions  of 
time  and  even  for  an  image  presented  in  a  flash  different 
types  of  information  may  be  carried  over  time  [33,  42], 
or  in  the  time  structure  of  the  neuronal  response.  In¬ 
corporating  the  time  dimension  in  neuronal  models  of 
recognition  is  a  challenge  that  is  just  beginning  to  be 
tackled  (see  M.  Giese,  ARVO  2000). 
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Figure  3:  Possible  implementations  of  different  recognition  tasks  in  a  common  computational  framework,  (a)  Memorizing  an 
individual  object  by  storing  the  activation  pattern  of  a  population  of  object- /view- tuned  units,  (b)  Learning  a  categorization 
task.  Note  that  in  principle  both  identification  and  categorization  could  also  be  learned  using  the  feature  unit  inputs  directly 
(e.g.,  the  C2  units  in  HMAX  [37]).  This  is  especially  important  for  many  categorization  tasks  which  are  based  on  components 
of  the  object.  Using  a  more  specialized  representation,  however,  such  as  units  tuned  to  (partial  or  full  views  of)  the  relevant 
objects,  simplifies  the  learning  task  (in  addition  to  other  computational  advantages  such  as  increased  robustness  to  noise  [39]). 
Similarly,  combining  several  of  these  modules  in  a  hierarchy  allows  the  system  to  exploit  prior  knowledge  in  the  learning  of  new 
tasks  (c),  e.g.,  when  using  a  "cat/ dog"  categorization  unit  as  input  to  a  "wild  cat"  unit.  In  another  situation,  if  the  activation 
of  such  a  "cat / dog"  unit  was  used  in  a  discrimination  task  along  with  the  activity  pattern  over  the  relevant  object-tuned  cells, 
discrimination  of  stimulus  pairs  straddling  the  boundary  would  be  expected  to  be  facilitated  [38]  —  the  classical  "Categorical 
Perception"  effect  [17]. 
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5.4  Some  key  predictions 

We  label  some  of  the  predictions  as  critical  (**)  if  their 
falsification  will  show  that  the  whole  class  of  models 
described  here  (and  summarized  in  figure  1)  is  a  "no- 
go".  Experimental  evidence  against  others  would  fal¬ 
sify  specific  models  (*). 

1**)  Several  "immediate"  recognition  tasks  (identi¬ 
fication  and  categorization)  mostly  use  feed-forward 
connections  during  the  task  itself  (possibly  not  in  the 
learning  phase). 

2*)  Objects  of  a  "nice"  class  [51]  (e.g.,  objects  roughly 
sharing  a  similar  3D  structure;  faces  are  the  best  ex¬ 
ample)  are  represented  in  terms  of  a  sparse  population 
code,  as  activity  in  a  small  (hundreds  to  thousands)  set 
of  cells  tuned  to  prototypes  of  the  class.  Objects  that 
do  not  belong  to  a  nice  class  (e.g.,  paperclips)  may  need 
to  be  represented  for  unique  identification  in  terms  of 
a  more  punctate  representation,  similar  to  a  look-up  ta¬ 
ble  and  requiring,  in  the  limit,  the  activity  of  just  a  few 
"grandmother-like"  cells. 

3*)  Identification  and  categorization  circuits  may  re¬ 
ceive  signals  from  the  same  or  equivalent  cells  tuned  to 
specific  objects  or  prototypes.  Identification  of  specific 
objects  should  be  more  susceptible  to  damage  (for  in¬ 
stance  by  lesions)  than  categorization,  as  identification 
requires  a  more  specific  discrimination  (cf.  above). 

4*)  For  objects  that  are  members  of  a  nice  class,  gener¬ 
alization  from  a  single  view  may  be  better  than  for  other 
objects  (for  non-image  plane  transformations  such  as 
different  illuminations  or  different  viewpoints  [28]). 
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